Skip to content

[SPARK-50520][PySpark] Respect timeout in df.rdd.countApprox()#56060

Open
rishav23 wants to merge 2 commits into
apache:masterfrom
rishav23:fix-spark-50520-countapprox-timeout
Open

[SPARK-50520][PySpark] Respect timeout in df.rdd.countApprox()#56060
rishav23 wants to merge 2 commits into
apache:masterfrom
rishav23:fix-spark-50520-countapprox-timeout

Conversation

@rishav23
Copy link
Copy Markdown

What changes were proposed in this pull request?

PySpark approximate RDD actions currently call getFinalValue() on the PartialResult returned by Spark approximate job APIs. This introduces blocking behavior and causes APIs like countApprox(timeout=...) to wait for full job completion instead of respecting timeout semantics. This PR changes PySpark to use PartialResult.initialValue(), which already contains the timeout-aware approximation computed by ApproximateActionListener.awaitResult(). Additionally, regression tests were added to validate:

  • timeout-aware approximate behavior
  • exact results when computation completes successfully

Why are the changes needed?

Spark approximate actions are designed to return partial results after the specified timeout. Scala APIs correctly expose this behavior through PartialResult, but PySpark currently forces blocking completion by calling getFinalValue(). As a result, PySpark countApprox() ignores timeout semantics and waits for full completion.

Does this PR introduce any user-facing change?

Yes, PySpark approximate RDD actions now correctly respect timeout semantics and return timeout-aware approximate results instead of blocking until full completion.

How was this patch tested?

  • Reproduced the issue locally using large RDDs
  • Verified timeout behavior before and after the fix
  • Added regression tests in python/pyspark/tests/test_rdd.py
  • Ran: python/run-tests.py --testnames pyspark.tests.test_rdd

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant